This is a playground for experimenting with something called R Markdown.
At first glance, R Markdown appears to be a fairly simple mark-up language whose raison d’etre is to enable the embedding of “live” R code into web pages and other hypertext-compatible documents.
Some useful links are listed below:
R Markdown basics (same link as on MOOC)
Embedding R code into a document (same link as on MOOC)
R Markdown cheatsheet (a different one!)
Regular expression cheatsheet (we linguists use regexes a lot!)
The theme used for this page is called readable.
You can use read.csv() to read a comma-separated table (.csv) into R, but it does require prior installation of the readr library. If you have RStudio, I warmly recommend using the interactive ‘import dataset’ menu instead.
As an alternative, read.table() is more generic than read.csv() and comes with base R. You have to specify a lot more stuff with it, such as the cell separator character (sep=“,”).
library(readr) learning2014 <- read_csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt")
Here are some ways to overview the data, which is always a good first step:
Show the dimensions of the data frame:
dim(learning2014)
dim(learning2014)
## [1] 166 7
Print a summary of its variables (columns):
summary(learning2014)
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
The following command opens a new window (or tab) in which you can interactively browse and explore the dataset (though if you’re using a pop-up blocker you’ll likely see nothing):
View(learning2014)
View(learning2014)
This dataset was apparently collected to investigate links between academic performance (variable “points”) and various personality and demographic factors. As is customary, each row represents one person from whom these variables were measured.
The following libraries are needed for the prettiest charts and graphs:
library(ggplot2)
library(GGally)
Let’s print out a hugely informative array of plots:
ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
library(ggplot2)
library(GGally)
ggpairs(learning2014, mapping = aes(col=gender,alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
This ggpairs() is a more advanced and sophisticated alternative to base R’s pairs(). What both do is draw a scatter plot correlating each pair of variables found in the dataframe (or just the pairs that the user specifies).
No highly obvious correlations emerge from the scatter plots. One predictable association can be discerned – namely, the one between exam scores and attitude towards the subject. The plot also seems to suggest a very slight negative correlation between age and points. Thirdly, a glance at the correlation coefficients reveals that predictor surf has the second-highest absolute value (distance from zero) after attitude. We’ll include these three explanatory variables in our linear regression analysis.
Linear regression measures correlation between a continuous outcome variable (AKA regressand or y variable) and one or more input variables, or predictors or x-variables, which can likewise be continuous but do not have to be. In linear regression we test the hypothesis that the outcome variable is a function, firstly, of each predictor multiplied by their regression coefficients and, secondly, an error term representing the distance by which the model fails to match the exact behavior of the outcome variable.
summary(lm(points ~ attitude + age + surf, data=learning2014))
summary(lm(points ~ attitude + age + surf, data=learning2014))
##
## Call:
## lm(formula = points ~ attitude + age + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.973 -3.430 0.167 3.997 10.444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.85628 3.53786 4.765 4.18e-06 ***
## attitude 3.42388 0.57353 5.970 1.45e-08 ***
## age -0.08713 0.05361 -1.625 0.106
## surf -0.96049 0.79943 -1.201 0.231
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.294 on 162 degrees of freedom
## Multiple R-squared: 0.2082, Adjusted R-squared: 0.1935
## F-statistic: 14.2 on 3 and 162 DF, p-value: 2.921e-08
Perhaps the most important figures in this printout are:
As common sense suggests, attitude shows the strongest correlation with points. The effect is statistically significant at the highest level. Age and surface learning also exert an effect, but their respective effects do not reach the required .05 level of statistical significance. It is very important to note, however, that age comes much closer to statistical significance than surf even though the absolute value of its regression coefficient is much smaller. What’s going on?
##
##
##
##
##
## <think break>
##
##
##
##
##
These figures reflect the fact that surf and age are measured on different scales. Surface learning ranges only from 1 to 5, while age (in our dataset) ranges from 17 to 55. Although a change by one single unit in surf is more consequential than a change of one year in age, the latter variable has much more explanatory power when its entire range of variation is taken into consideration.
Remember that my eye caught the correlative pattern in the points by age scatter plot but not in age by surf? This is a good example of how graphical presentations of data are often more informative than cold numbers, demonstrating the wisdom of Kimmo’s “always draw plots” principle.
However, both age and surf must now be dropped from the model due to their statistically non-significant effect.
summary(lm(points ~ attitude, data=learning2014))
summary(lm(points ~ attitude, data=learning2014))
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The Residuals table quantifies the error term. For example, the worst overestimation of the model is by almost 17 points (this student massively underachieved compared to the model prediction), while the worst overestimation is by 10 points (this student exceeded the model’s expectations by 10 exam points).
Intercept reports the point at which the model intersects the Y axis. We might also think of it as the value of the outcome variable before the effects of any explanatory variable have been modeled.
I’m not entirely sure precisely what the Std.Error column measures. My guess is that it might reflect the average amount by which the predictions based on attitude * Coefficient miss the real, observed change in points.
The column entitled t-value report the t test statistic. Bigger is better (more significant), but the T-test always requires the Degrees of Freedom to be factored into the equation. It is thus not an entirely straightforward indicator of predictor significance – that is what the Pr(>|t|) value is for. If the Pr(>|t|) value has either a dot or asterisks next to it, then that predictor is considered statistically significant.
I’m not sure what exactly is measured by Residual standard error, although it clearly somehow measures the variance of the residuals.
Multiple R-squared indicates the percentage of the overall variation in points that can be predicted by all the predictors included in the model together. Here, attitude is the only one. Adjusted R-squared is an estimate of R-squared that penalizes for too many predictors – therefore its value here (with only one predictor) is nearly identical to un-adjusted R-squared.
The F-statistic tests the highly skeptical hypothesis that the model has zero statistically significant predictive power. This hypothesis usually, as now, is proven false, as indicated by the extremely small p-value.
plot(lm(points ~ attitude, data=learning2014),which=1)
plot(lm(points ~ attitude, data=learning2014),which=1)
Residuals vs Fitted plotting is used to analyze whether there is any systematic pattern to the model error. Such a pattern would be bad news because it would cast doubt on one of the crucial premises of linear regression, namely, that the variance of the residuals is random. Now we can see slight heteroscedasticity in the model, i.e. the model appears to predict high exam scores slightly more reliably than low ones. This pattern is weak and only barely discernible though.
plot(lm(points ~ attitude, data=learning2014),which=2)
plot(lm(points ~ attitude, data=learning2014),which=2)
Q-Q plots are used to analyze the validity of another model assumption – that the residuals of the model are normally distributed. In an ideal case, the observations (dots) would align neatly along a straight line. Now we can see that this is not entirely the case – large negative residuals are more frequent than expected, while large positive residuals are less frequent than expected.
plot(lm(points ~ attitude, data=learning2014),which=5)
plot(lm(points ~ attitude, data=learning2014),which=5)
This plot illustrates Leverage, i.e. how influential each observation (data point) is in determining the x-variable’s regression coefficient. The idea is to ascertain that there are no outliers skewing the model, i.e. exerting a massive impact on the angle of the regression line. Any highly unusual observation at either extreme of the x-variable’s distribution is potentially an influential outlier. In our case, we can see that no hugely influential observations exist – the scale of the leverage goes only from 0 to about .05, so there are practically no outliers with disproportionate leverage.
Welcome to the logistic regression chapter! Here we will practise predicting categorical outcomes, which are trickier to model than continuous ones because they require us to calculate probabilities. Probabilities do not range between negative and positive infinity like continuous response variables tend to, so in order to make the regression coefficients more analogous to those in linear regression, we need a link function inbetween. We’ll start by importing our training dataset into R:
alcohol<-read.table(file = "drunkards.txt", sep = "\t", header = TRUE, row.names = NULL)
This data was collected to analyze correlations between alcohol consumption and various social and demographic factors.
It was already seen in the Datacamp exercise that sex correlates heavily with drinking habits. This is unsurprising. The reasons behind the phenomenon are rather obvious and need not be discussed here. Therefore, I will leave this variable out of my own analysis and instead choose the following 4:
Let’s see what we can find out.
table(alcohol$age)
##
## 15 16 17 18 19 20 22
## 81 107 100 81 11 1 1
I chose age as one of the variables because I wanted some of my variables to be continuous. There are other continuous variables in the data – such as grades and rates of failure and absence. However, I felt that these variables present serious chicken-or-egg problems for the interpretation of results: even if bad grades are found to correlate with high alcohol use, can we really claim that the bad grades came first, i.e., caused the drinking and not the other way around? I find such a premise questionable. Age has no such problems: it is always the cause, never the effect (everyone grows older at the same rate regardless of their drinking habits).
There are only 7 age groups in the dataset, two of which are so under-represented as to be basically negligible. There are very few students over the age of 18 in the dataset. Findings based on 11 observations or less are rarely statistically significant. We should be able to get something out of the data on the other age groups though.
We will start drawing pictures now. Let us therefore load ggplot2 into memory, as it seems to be more flexible and customizable than base R’s graphics library.
library(ggplot2)
First, let’s look at a bar chart of high alcohol use as function of age
ggplot(data=alcohol, aes(alcohol$age, fill=alcohol$high_use)) + geom_bar(position = "dodge") + ylab("count") + ggtitle("High alcohol consumption as a function of age", subtitle="N = 382") + scale_alpha_continuous(breaks = min(alcohol$age):max(alcohol$age)) + xlab("age") + scale_x_continuous(breaks=15:22)
This is not a bad way to visualize the frequency of a phenomenon across groups. We can see that the proportion of heavy drinking seems highest in the 17-year-olds. However, since the age groups vary in size, I find comparison easier using a percent-stacked bar chart instead. The tradeoff is that we lose sight of the absolute group sizes:
ggplot(data=alcohol, aes(alcohol$age, fill=alcohol$high_use)) + geom_bar(position = "fill") + ylab("relative frequency") + ggtitle("High alcohol consumption as a function of age", subtitle="N = 382") + scale_x_continuous(breaks = min(alcohol$age):max(alcohol$age)) + xlab("age")
This shows us that my initial visual impression was inaccurate – the 19-year-olds are proportionately (slightly) heavier drinkers than the 17-year-olds. This data could hardly be said to support the hypothesis that heavy alcohol consumption becomes more frequent with age, however. Rather, what seems to happen is that it increases for a couple of years after the first taste, then stabilizes in the late teens.
We could now assess the the statistical significance of these findings using the chi-squared test, but I’ll move on to the next variable instead in order to leave some analysis for logistic regression to do.
summary(alcohol$reason)
## course home other reputation
## 140 110 34 98
This variable has a more convenient frequency distribution of levels (different values that it can have) than the previous one: no single one of its groups is grossly underrepresented the way some of the age groups were. Now that we know that each group is sufficiently represented, let’s visually assess their respective rates of high alcohol use:
ggplot(data=alcohol, aes(alcohol$reason, fill=alcohol$high_use)) + geom_bar(position = "fill") + ylab("relative frequency") + ggtitle("Alcohol consumption and reason of school choice", "N = 382") + scale_x_discrete(labels=levels(alcohol$reason)) + xlab("Reason for having chosen one's current school")
This variable appears to exert a significant effect! Contrary to my hypothesis, however, it is not the “home” group that exhibits the highest rate of heavy alcohol consumption – it’s the “other” group, i.e., the wastebasket category reserved for the miscellanea of observations not falling into any of the three “main” categories of this variable. It is unfortunate that the researchers have chosen not to record the information on these “other” reasons for choosing a school – important information was lost with that decision. Other than that, the “home” group does indeed display an increased incidence of high use relative to the other explicitly defined groups.
Again, we’ll postpone further analysis to later stages.
table(alcohol$studytime)
##
## 1 1.5 2 2.5 3 4
## 100 4 187 4 60 27
Uh oh. Something goofy is afoot. What’s up with the levels “1.5” and “2.5”, and why are their frequencies so darn low? If we take a look at the metadata, we see that studytime was designed to be an ordinal variable with levels from 1 through 4. Then where have the intermediate levels 1.5 and 2.5 come from?
## <think break>
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## </think break>
They come from the Data Wrangling phase of this exercise, where the instructions were that whenever there were two observations of a given numeric variable (one from each of the two original datasets), we had to use the mean of these two values in the new combined dataset. In my estimation, this is not a viable approach to studytime because it is not a continuous variable. Its original levels were less than 2 hours, 2 to 5 hours, 5 to 10 hours and over 10 hours of weekly study. Since the levels are not evenly spaced, can we really calculate an average?
IMHO a better approach would have been to treat studytime as essentially categorical, using only the value from the first of the source datasets (students_mat) whenever the two datasets differed. This would be methodologically consistent with the treatment of the other categorical (non-numeric) variables in the dataset.
Unfortunately, there is no easy way to identify which rows of the source dataset student_mat correspond to these goofy values so that they could be used as replacements. This is because the rows in the two source datasets lack a unique identifier field (which IMHO is a methodological error). Even the combination of 13 columns that we used as student identifiers when merging the data frames is not reliable – there are rows with identical sequences of values in these columns. Therefore, to replace the goofy studytime values with the original integer ones from student_mat, I made a temporary adjustment to the code I had used to combine the two datasets in the previous section, simply adding one condition to the for-loop to give studytime the same treatment as the non-numeric variables:
if (startsWith(colnames(two.columns)[1],"studytime")) {AlcoholConsumption[variable] <- two.columns[,1]}
The resulting column uses the original integer values from student_mat. We can now add it to our dataset as a new column. The old studytime with the goofy decimal values will remain intact, however – we’re just bringing in a better alternative to it. This is a good rule of thumb when modifying existing variables – always keep the original data too.
alcohol$studytime.adjusted <- AlcoholConsumption$studytime
Let’s take a look at the contents of our new column:
table(alcohol$studytime.adjusted)
##
## 1 2 3 4
## 103 190 62 27
Much better. Now there are no ridiculously underrepresented groups. However, we should make another little adjustment in order to prevent R from treating this variable as continuous. This is easily accomplished by converting our new studytime.adjusted variable into a factor. A factor is a vector whose values are treated as categorical and can be named and ordered however we choose. In our case, we want to preserve the original ordering. However, we do want to change the names of the levels – the labels 1, 2, 3 and 4 are a bit misleading as they might lead us to assume that the levels are evenly spaced. We will replace these numeric labels with new ones that accurately reflect what the levels represent. These adjustments are easy to do in R:
alcohol$studytime.factor <- as.factor(alcohol$studytime.adjusted) #Create a version of studytime.adjusted that is categorical
levels(alcohol$studytime.factor) <- c("<2h","2-5h","5-10h",">10h") #Name its levels appropriately
Now the frequency table looks like this.
table(alcohol$studytime.factor)
##
## <2h 2-5h 5-10h >10h
## 103 190 62 27
With these conversions made, let’s take a look at how the data compares to our hypothesis that weekly study time is inversely correlated with risk of high alcohol use. Here’s the percent-stacked bar chart.
ggplot(data=alcohol, aes(alcohol$studytime.factor, fill=alcohol$high_use)) + geom_bar(position = "fill") + ylab("relative frequency") + ggtitle("High alcohol consumption as a function of weekly study time", "N = 382") + scale_x_discrete(labels=levels(alcohol$studytime.factor)) + xlab("weekly study time")
This data supports my fairly self-evident hypothesis that diligent students are less often heavy drinkers than lazy ones.
Alcohol lowers inhibitions and boosts libido. This is why it has been a powerful aid in human pair-bonding for aeons. In fact, this might just be the most important reason it is used in the first place, whether people admit it or not. Therefore, the assumption that people who are in a relationship, i.e. have already pair-bonded, have less reason to use alcohol was a fairly obvious one. As a binary variable, this one should be the easiest of the four to analyze.
Contingency tables (aka cross-tabulations) are a handy way of assessing correlations between variables that can be represented in 3x3 or smaller tables. We use them a lot in linguistics, e.g. when assessing whether the frequency distribution of a categorical variable such as that-omission vs. that-retention has remained constant or changed between two time periods. High alcohol consumption as a function of relationship status is a virtually identical phenomenon from a methodological standpoint. We’ll start by printing a contingency table of raw frequencies:
addmargins(table(alcohol$high_use, alcohol$romantic, dnn = c("high_use", "has_bf/gf")))
## has_bf/gf
## high_use no yes Sum
## FALSE 180 88 268
## TRUE 81 33 114
## Sum 261 121 382
Hmm…not nearly as much of a difference as I expected. For a clearer picture, a table of relative frequencies is useful:
round(prop.table(table(alcohol$high_use, alcohol$romantic, dnn=c("high_use","has_bf/gf")), 2), digits=2)
## has_bf/gf
## high_use no yes
## FALSE 0.69 0.73
## TRUE 0.31 0.27
Students with a romantic partner do seem less likely to be heavy drinkers, but the difference is only 4 percentage points in this dataset. Now I’m particularly tempted to test the significance of the effect with the chi-square test, but I’ll refrain. Logistic regression should shed more light on all that.
model<-glm(high_use ~ age + reason + studytime.factor + romantic, data = alcohol, family = "binomial") #Creates the model
logodds<-coef(model) #Extracts and stores just the log of odds -values produced by the model
ORs<-cbind(data.frame(Estimate = round(exp(logodds),digits=3)),round(exp(confint(model)),digits=3)) #Extracts and stores the odds ratios and their confidence intervals, rounded to 3 decimals, in an eye-friendly table
## Waiting for profiling to be done...
summary(model) #Prints out the results of the model
##
## Call:
## glm(formula = high_use ~ age + reason + studytime.factor + romantic,
## family = "binomial", data = alcohol)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5858 -0.8676 -0.6575 1.2070 2.3469
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.2950 1.6942 -2.535 0.011239 *
## age 0.2322 0.1015 2.286 0.022235 *
## reasonhome 0.2789 0.2844 0.981 0.326773
## reasonother 0.8068 0.4083 1.976 0.048138 *
## reasonreputation -0.2054 0.3236 -0.635 0.525604
## studytime.factor2-5h -0.4035 0.2652 -1.521 0.128178
## studytime.factor5-10h -1.4077 0.4216 -3.339 0.000842 ***
## studytime.factor>10h -1.1548 0.5962 -1.937 0.052769 .
## romanticyes -0.2624 0.2598 -1.010 0.312486
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 435.69 on 373 degrees of freedom
## AIC: 453.69
##
## Number of Fisher Scoring iterations: 4
As we anticipated at the beginning, higher age is shown to favor the probability of heavy alcohol use. The effect is significant at the .05 level. However, since only 4 different ages (15 through 18) are adequately represented, this is not much of a finding. It is to be expected that adults use more alcohol than kids. More interesting would be to look at the progression of people’s drinking habits from early adulthood to middle age and beyond, but that’s another topic.
My prediction on the effect of reason proves somewhat too bold. I assumed that people who choose a school near their homes are uninspired students and likely heavier-than-average drinkers. In our dataset, having chosen your school based on its convenient location increases the risk of being a heavy drinker by 32%. Here are the odds ratios and their confidence intervals:
ORs
## Estimate 2.5 % 97.5 %
## (Intercept) 0.014 0.000 0.364
## age 1.261 1.036 1.544
## reasonhome 1.322 0.756 2.312
## reasonother 2.241 1.002 5.008
## reasonreputation 0.814 0.427 1.526
## studytime.factor2-5h 0.668 0.397 1.125
## studytime.factor5-10h 0.245 0.102 0.539
## studytime.factor>10h 0.315 0.085 0.931
## romanticyes 0.769 0.458 1.272
Though a 32% increase seems non-negligible, the effect falls way short of statistical significance. It is not a matter of a tiny sample either, since there were 110 people in this group. I guess I’ll just say that some correlation was found but it wasn’t strong enough to make confident statements.
Something intriguing was discovered with this variable, however. The students whose reason for picking the school they did was classified as “other” appear to be heavier drinkers than those with other reasons, and the effect is statistically significant at the .05 level. It is most unfortunate that the researchers did not document the “other” reasons. We are now left to speculate on what those other reasons might be and why they correlate with high alcohol consumption. Here’s a hypothesis: most of the “other” reasons for choosing a school are that a particular girl/guy is attending it. And if one is willing to attend an inconveniently located school over a girl/guy, then it stands to reason that one is also willing to resort to other somewhat drastic measures – such as liquid courage – to get him/her.
I was mostly right about studytime’s correlation with drinking habits. Compared to the reference group (<2h of study a week), each of the more studious groups proves less likely to be heavy drinkers. The strange thing is that it is not the people who study the most who drink the least – the strongest disfavoring effect is found in those who study from 5 to 10 hours a week. Their risk of high alcohol use is a whopping 76% smaller (an odds ratio of .24) than that of the laziest students, and the effect is significant at the highest level (p < .001). The most studious group also shows a strong disfavoring effect at -69%. Technically speaking, effect falls slightly short of the critical value of statistical significance (its p-value is .052), but that can be attributed to the small sample size (only 27 students) of this group. Also note that the confidence interval of the group’s Odds Ratio does not include 1! This is perhaps more intuitively undestood from the logodds table, which shows that the entire confidence interval of studytime.factor>10h is in the negatives:
round((confint(model)),digits=3)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -7.668 -1.011
## age 0.035 0.434
## reasonhome -0.279 0.838
## reasonother 0.002 1.611
## reasonreputation -0.851 0.423
## studytime.factor2-5h -0.924 0.118
## studytime.factor5-10h -2.285 -0.617
## studytime.factor>10h -2.463 -0.072
## romanticyes -0.781 0.240
It is interesting, however, that it’s not the most studious group that drinks the least. We might hypothesize that those who study 5-10h a week live happy, balanced lives and do not need to drink themselves silly to relax and have fun, whereas those who study massive amounts need more extreme measures, such as intoxicants, in order to unwind.
Lastly, having a romantic partner proves statistically non-significant in predicting high alcohol use. The risk does decrease in a relationship, but only by 23%. P-value is .31; sample size is in the triple digits, i.e. large enough. I guess I was simply mistaken on this one.
We will now test the predictive power of our model. We’ll start by making a convenient “training copy” of our data, consisting of the 4 explanatory variables only. This way the real, observed outcomes are temporarily unknown and hidden, making the whole exercise more interesting!
alcohol.practice <- alcohol[c("age","reason","studytime.factor","romantic")]
Here’s a glimpse of the new data frame:
head(alcohol.practice)
## age reason studytime.factor romantic
## 1 15 home >10h no
## 2 15 reputation 2-5h yes
## 3 15 reputation <2h no
## 4 15 course 5-10h no
## 5 15 reputation 5-10h yes
## 6 15 course 5-10h no
The probability of the target outcome for a given data point (student) is calculated from the sum of the coefficient of Intercept and the coefficients of the predictor values corresponding to the data point. This sum is converted into an odds ratio, which in turn is converted into a probability. We’ll illustrate this by arbitrarily selecting a student from our dataset. For example, let’s pick Student #100 from the dataset and make a prediction on whether they’re a heavy drinker:
alcohol.practice[100,]
## age reason studytime.factor romantic
## 100 17 course 5-10h no
Let’s reprint the coefficients so we can work more easily:
as.data.frame(coef(model))
## coef(model)
## (Intercept) -4.2949807
## age 0.2321509
## reasonhome 0.2789380
## reasonother 0.8067722
## reasonreputation -0.2054305
## studytime.factor2-5h -0.4035197
## studytime.factor5-10h -1.4077006
## studytime.factor>10h -1.1547965
## romanticyes -0.2623731
So the natural logarithm of the odds of this person using a lot of alcohol is:
(log_of_odds <- -4.2949807 + 17*0.2321509 + 0 + -1.4077006 + 0) #Get the logarithm of odds
## [1] -1.756116
And the odds are:
(odds <- exp(log_of_odds)) #Get the odds
## [1] 0.1727144
And the probability is:
(probability <- odds/(1+odds)) #Get the probability
## [1] 0.1472775
That’s quite a bit of calculus to do, so fortunately there’s a predict() function does all of it for us. We need only to specify the model, the data, and the type of our outcome variable:
predict(model, alcohol.practice[100,], type = "response")
## 100
## 0.1472774
The slight discrepancy between the output of predict() and the manually calculated value is that the former uses exact values, while my manual computation uses the rounded values from the model printout.
Let’s now add a column to our practice data frame with the estimated probabilities of high alcohol use for every student, and look at a sample of the output:
alcohol.practice$probability <- predict(model, alcohol.practice, type="response")
alcohol.practice[seq(1,nrow(alcohol.practice),25),]
## age reason studytime.factor romantic probability
## 1 15 home >10h no 0.1559632
## 26 15 course <2h no 0.3073117
## 51 16 home <2h yes 0.3626220
## 76 16 home <2h no 0.4251593
## 101 17 reputation >10h no 0.1533398
## 126 17 reputation 5-10h no 0.1232997
## 151 18 course 2-5h no 0.3729025
## 176 15 reputation 2-5h yes 0.1565611
## 201 15 home 2-5h no 0.2814462
## 226 16 home <2h no 0.4251593
## 251 16 course <2h no 0.3588022
## 276 17 course 2-5h no 0.3203996
## 301 17 course <2h no 0.4137666
## 326 18 home <2h no 0.5405788
## 351 18 home 5-10h no 0.2235620
## 376 18 home 2-5h no 0.4400777
And as you already guessed, the binary prediction on each student is that if the estimated probability of high alcohol use is greater than 50%, we predict that s/he is a heavy drinker; otherwise we predict the s/he isn’t. We will now add a column with the final prediction for every student and look at the output:
alcohol.practice$prediction <- alcohol.practice$probability > 0.5
alcohol.practice[seq(1,nrow(alcohol.practice),25),]
## age reason studytime.factor romantic probability prediction
## 1 15 home >10h no 0.1559632 FALSE
## 26 15 course <2h no 0.3073117 FALSE
## 51 16 home <2h yes 0.3626220 FALSE
## 76 16 home <2h no 0.4251593 FALSE
## 101 17 reputation >10h no 0.1533398 FALSE
## 126 17 reputation 5-10h no 0.1232997 FALSE
## 151 18 course 2-5h no 0.3729025 FALSE
## 176 15 reputation 2-5h yes 0.1565611 FALSE
## 201 15 home 2-5h no 0.2814462 FALSE
## 226 16 home <2h no 0.4251593 FALSE
## 251 16 course <2h no 0.3588022 FALSE
## 276 17 course 2-5h no 0.3203996 FALSE
## 301 17 course <2h no 0.4137666 FALSE
## 326 18 home <2h no 0.5405788 TRUE
## 351 18 home 5-10h no 0.2235620 FALSE
## 376 18 home 2-5h no 0.4400777 FALSE
There! Now we have an estimated probability of high alcohol use for each student, plus a binary prediction based on that probability. We can finally unveil the real, actual outcomes observed in the dataset, and juxtapose them with the predicted outcomes for comparison:
alcohol.practice$real_outcome <- alcohol$high_use
alcohol.practice[seq(1,nrow(alcohol.practice),25),5:7]
## probability prediction real_outcome
## 1 0.1559632 FALSE FALSE
## 26 0.3073117 FALSE TRUE
## 51 0.3626220 FALSE FALSE
## 76 0.4251593 FALSE FALSE
## 101 0.1533398 FALSE FALSE
## 126 0.1232997 FALSE FALSE
## 151 0.3729025 FALSE FALSE
## 176 0.1565611 FALSE FALSE
## 201 0.2814462 FALSE TRUE
## 226 0.4251593 FALSE TRUE
## 251 0.3588022 FALSE FALSE
## 276 0.3203996 FALSE FALSE
## 301 0.4137666 FALSE FALSE
## 326 0.5405788 TRUE TRUE
## 351 0.2235620 FALSE FALSE
## 376 0.4400777 FALSE TRUE
We can see that our model doesn’t seem to be doing a great job predicting outcomes. But let’s calculate the exact degree of its fallibility. We’ll first create a column reporting whether the prediction failed:
alcohol.practice$prediction.failed <- alcohol.practice$prediction!=alcohol.practice$real_outcome
alcohol.practice[seq(1,nrow(alcohol.practice),25),5:8]
## probability prediction real_outcome prediction.failed
## 1 0.1559632 FALSE FALSE FALSE
## 26 0.3073117 FALSE TRUE TRUE
## 51 0.3626220 FALSE FALSE FALSE
## 76 0.4251593 FALSE FALSE FALSE
## 101 0.1533398 FALSE FALSE FALSE
## 126 0.1232997 FALSE FALSE FALSE
## 151 0.3729025 FALSE FALSE FALSE
## 176 0.1565611 FALSE FALSE FALSE
## 201 0.2814462 FALSE TRUE TRUE
## 226 0.4251593 FALSE TRUE TRUE
## 251 0.3588022 FALSE FALSE FALSE
## 276 0.3203996 FALSE FALSE FALSE
## 301 0.4137666 FALSE FALSE FALSE
## 326 0.5405788 TRUE TRUE FALSE
## 351 0.2235620 FALSE FALSE FALSE
## 376 0.4400777 FALSE TRUE TRUE
Crucially, R is so clever that it treats the logical values TRUE and FALSE numerically as 1 and 0, respectively. Thus calculating the error rate is easy as pie – it is simply the mean value of the column reporting TRUE for every student for whom the prediction failed.
(ErrorRate <- mean(alcohol.practice$prediction.failed))
## [1] 0.2879581
The model predicts correctly about 71% of the time.
Now we can compare the model’s predictive performance to a strategy of simple but educated guessing. First we need to know the overall rate of high alcohol use in the entire dataset, i.e., its empirical probability.
(EmpiricalProbability <- mean(alcohol$high_use))
## [1] 0.2984293
So it’s slightly below 30%. An educated guesser would take this empirical probability and toss a commensurately biased coin for each student in the dataset, resulting in a sequence of predictions like this:
(alcohol.practice$EducatedGuesses <- sample(c(TRUE, FALSE), size = nrow(alcohol), replace = TRUE, prob = c(EmpiricalProbability,1-EmpiricalProbability)))
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [12] FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [23] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [34] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [45] TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## [67] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [89] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [100] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
## [111] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
## [122] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [133] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [155] FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [166] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
## [177] TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [188] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
## [199] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## [210] TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
## [221] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE
## [232] FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE
## [243] TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [254] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [265] TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
## [276] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [287] FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
## [298] TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
## [309] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## [320] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
## [331] TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [342] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [353] TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [364] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [375] FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE
alcohol.practice$EducatedGuessFailed <- alcohol.practice$real_outcome!=alcohol.practice$EducatedGuesses
Now then, we will simply cross-tabulate the outcome distributions of these two prediction strategies.
ModelPredictionFail <- table(alcohol.practice$prediction.failed)
EducatedGuessFail<- table(alcohol.practice$EducatedGuessFailed)
(CrossTab <- rbind(ModelPredictionFail, EducatedGuessFail))
## FALSE TRUE
## ModelPredictionFail 272 110
## EducatedGuessFail 207 175
The chi-square test is perfect for these situations. It measures the total, aggregate difference between the frequency distributions of two (or more) samples, and reports the probability that an equal or greater difference would occur if both samples were from the same population. This probability is the p-value. In our case, the “populations” are the hypothetical populations of successful and failed predictions generated by the two guessing strategies. If the prediction strategies are equally good, then it is unlikely that their observed success rates in our dataset of 382 students would differ enough to result in a p-value below .05. Let’s find out:
chisq.test(CrossTab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: CrossTab
## X-squared = 22.923, df = 1, p-value = 1.686e-06
My output reports a p-value far below the highest significance level of .001, so it seems safe to assume that the model soundly outperforms the educated guessing strategy. But bear in mind that the educated guesses are re-generated from the empirical probabilities every time this page is loaded, so your results will not be identical to mine! Should you somehow get lucky and chance upon an instance of the educated guessing strategy outperforming the model (or even being competitive enough to result in a statistically non-significant p-value), feel free to take a screenshot and email it to me at juho decimalpoint ruohonen at helsinki decimalpoint eff eye.
We’ll now go back to the full dataset with all the predictors intact – we will need them shortly. We’ll add a new column containing the probabilities calculated by our model, and specify an error rate calculation function ‘loss.func’ which will be needed in cross-validation.
alcohol$probability <- predict(model, alcohol, type="response")
loss.func<-function(class,prob){n_wrong<-abs(class-prob)>0.5;mean(n_wrong)}
loss.func(alcohol$high_use,alcohol$probability)
## [1] 0.2879581
Then we’ll perform a 10-fold cross-validation of the model. This means that we divide the dataset into 10 random, complementary subsets of equal size, then perform 10 cycles of the following: 1. Recalculate the model coefficients based on sub-datasets 1 through 9, then test the predictive performance of the model on sub-dataset 10 and store the error rate in memory. 2. Re-calculate the model coefficients based on sub-datasets 2 through 10, then test the predictive performance of the model on sub-dataset 1 and store the error rate in memory. 3. Re-calculate the model coefficients based on sub-datasets 3 through 10 and 1, then test the predictive performance of the model on sub-dataset 2 and store the error rate in a vector, etc. until every sub-dataset has been used for both model creation and model testing at least once. 4. Calculate the mean error rate from the individual error rates of each cycle.
All this can be done with a single function call if we have the library boot.
library(boot)
CrossVal <- cv.glm(data = alcohol, glmfit=model, cost = loss.func, K = 10)
CrossVal$delta[[1]]
## [1] 0.2984293
Now the error rate is of course higher than when we were simply testing the model’s predictive power on the exact same dataset on which it was built.
This model performs worse than the one based on failures, absences and sex that we used in the Datacamp exercises. However, I chose these variables for autodidactic reasons (to teach myself to work with both continuous and categorical ones) rather than to maximize prediction performance. I predict, however, that it shall be easy to come up with a 4-variable model with an error rate below 25%. Actually, that goal is too modest. Let’s shoot for a 3-variable model with an error rate lower than 25%.
To make the model-building and experimentation quicker and more convenient, we’ll write a function to minimize the typing and manual calculus involved:
logregr <- function(dataset = alcohol, model.formula){
tmp.df <- dataset; attach(tmp.df,warn.conflicts = FALSE)
logitmodel <- glm(data = tmp.df, formula = model.formula, family = "binomial")
print(summary(logitmodel))
cat("Logodds coefficients and their confidence intervals:\n"); print(cbind(data.frame(Estimate=coef(logitmodel)), confint(logitmodel)))
cat("\nError rate in 10-fold cross-validation:\n", cv.glm(data = tmp.df, glmfit = logitmodel, cost = loss.func, K = 10)[[3]][1])
detach(tmp.df)
}
Now we only need to specify the dataset and the formula stating the response variable and the predictors. The function will automatically print out the model summary, a table of the logodds-coefficients with their confidence intervals (which I find more intuitive than the odds ratios), and finally the error rate obtained in a cycle of 10-fold cross-validation.
So, let’s try to find a model that uses only 3 predictors and achieves an error rate lower than 25%.
logregr(alcohol, high_use ~ sex+famrel+studytime.factor)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5113 -0.8705 -0.6293 1.2387 2.1741
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3277 0.5375 0.610 0.54203
## sexM 0.7458 0.2500 2.984 0.00285 **
## famrel -0.3160 0.1267 -2.495 0.01260 *
## studytime.factor2-5h -0.2685 0.2688 -0.999 0.31798
## studytime.factor5-10h -1.0123 0.4340 -2.332 0.01969 *
## studytime.factor>10h -1.2909 0.5993 -2.154 0.03124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 434.07 on 376 degrees of freedom
## AIC: 446.07
##
## Number of Fisher Scoring iterations: 4
##
## Logodds coefficients and their confidence intervals:
## Waiting for profiling to be done...
## Estimate 2.5 % 97.5 %
## (Intercept) 0.3277477 -0.7299430 1.38518832
## sexM 0.7458322 0.2594951 1.24132207
## famrel -0.3160191 -0.5665000 -0.06815506
## studytime.factor2-5h -0.2684703 -0.7942086 0.26156699
## studytime.factor5-10h -1.0122745 -1.9090880 -0.19217144
## studytime.factor>10h -1.2908808 -2.6060013 -0.20350510
##
## Error rate in 10-fold cross-validation:
## 0.3062827
logregr(alcohol, high_use ~ sex+famrel+Pstatus)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4113 -0.8667 -0.6679 1.2311 1.9401
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.02104 0.58866 0.036 0.97149
## sexM 0.93116 0.23452 3.971 7.17e-05 ***
## famrel -0.33016 0.12446 -2.653 0.00799 **
## PstatusT -0.08714 0.38065 -0.229 0.81894
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 443.77 on 378 degrees of freedom
## AIC: 451.77
##
## Number of Fisher Scoring iterations: 4
##
## Logodds coefficients and their confidence intervals:
## Waiting for profiling to be done...
## Estimate 2.5 % 97.5 %
## (Intercept) 0.02103570 -1.1487397 1.17072656
## sexM 0.93115696 0.4765509 1.39750348
## famrel -0.33015614 -0.5765789 -0.08690103
## PstatusT -0.08713609 -0.8145044 0.68972175
##
## Error rate in 10-fold cross-validation:
## 0.2879581
Sigh. I wanted to find the better model without resorting too much to the silly variables used in the Datacamp example – that model predicts high alcohol use from school absenteeism and failures. Common sense suggests that high use of alcohol is a cause, not a result, of these phenomena. But in order to be able to move on, I’ll just use sex and absences as the first two variables, and try to find a third one that improves model performance relative to failures
logregr(alcohol, high_use ~ sex+absences+goout)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7897 -0.8155 -0.5441 0.8346 2.4743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16640 0.47692 -8.736 < 2e-16 ***
## sexM 0.96586 0.25451 3.795 0.000148 ***
## absences 0.08495 0.02239 3.794 0.000148 ***
## goout 0.72847 0.11992 6.074 1.24e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 388.10 on 378 degrees of freedom
## AIC: 396.1
##
## Number of Fisher Scoring iterations: 4
##
## Logodds coefficients and their confidence intervals:
## Waiting for profiling to be done...
## Estimate 2.5 % 97.5 %
## (Intercept) -4.1664029 -5.14237353 -3.2687073
## sexM 0.9658600 0.47308974 1.4730023
## absences 0.0849536 0.04234994 0.1312952
## goout 0.7284671 0.49910441 0.9703140
##
## Error rate in 10-fold cross-validation:
## 0.2198953
We have a a winner! Maleness, school absenteeism, and going out with friends are all significant correlates – though not necessarily causes – of high alcohol use in 15-to-22-year-old Portuguese students.
Though computers perform immensely complex calculations for us, they cannot interpret the results or draw conclusions. Interpretation is always the task of the researcher, who must rely on his/her knowledge of the subject matter to “connect dots”" emerging from calculations. In a similar vein, even a high coefficient obtained by a model is in itself no evidence whatsoever of a causal link between variables x and y.
In all variants of regression analysis, the computer is instructed to “explain” the variance of the y-variable as a function of the x-variable(s) and a constant, and it will make its best effort to do just that – regardless of how absurd that “explanation” may be. A prime example of this is the association between ice-cream consumption and drownings. Correlation is not causality. Therefore, we need to really know what we’re doing if we’re going to dump truckloads of predictors into the model – the more we dump in, the better the computer will be able to “explain” the variation. But unless we can plausibly explicate HOW a given x-variable with a significant p-value causes the change in the y-variable, we cannot claim a causal link.
Therefore, aside from prediction power, model economy is another key goal of regression analysis. We aim to explain as much of the variation of Y as possible with as few y-variables as possible, because every new y-variable we add to the model increases the amount of explaining we have to do – we want our explanation of a given phenomenon to be as succinct as possible.
The AIC score provides a measure of this. It is a function of residual deviance and model complexity, i.e., inversely correlated with prediction power and model simplicity. The lower the AIC score, the more accurate and/or economical the model.
With these guiding principles in mind, we will now practise manual step-down regression (also known as backwards elimination). We start with a bloated model featuring 10 predictors, and we will try to iteratively trim it down to just the most essential ones.
In order to make the printouts more succinct and less cluttered, I’ll use the original (numeric) version of studytime despite the questionability of its continuous nature. I will also leave absences out of the equation despite its strong correlation with the response variable: I remain unconvinced that it causes high alcohol use rather than being caused by high alcohol use. The following, then, are the 10 predictors that we begin with:
Before beginning, we will finetune our logregr() function suit the task better. Rather than 10-fold cross-validation, we’ll tell it to use the more exhaustive and reliable “leave-one-out” technique, which divides the data into as many partitions as there are observations. This means that the R will rebuild the model on 381 of the total 382 observations and then try to predict the outcome of the one that was left out – 382 times – and calculate the mean error rate. The procedure is cpu-intensive – don’t try it on a 8086! The leave-one-out method is specified simply by omitting the K argument of the cv.glm call. Secondly, we’ll tell logregr() it to skip the printing of confidence intervals – showing us the model summary should suffice (note how easy it is to “deactivate” lines of code while still keeping them available for later use simply by prepending # to them). Lastly, let’s have the function output both the training error and the testing error in tabular format, so that we can easily document our model-trimming process.
logregr <- function(dataset = alcohol, model.formula){
tmp.df <- dataset; attach(tmp.df,warn.conflicts = FALSE)
logitmodel <- glm(data = tmp.df, formula = model.formula, family = "binomial")
print(summary(logitmodel))
#cat("Logodds coefficients and their confidence intervals:\n"); print(cbind(data.frame(Estimate=coef(logitmodel)), confint(logitmodel)))
tmp.df$probability <- predict(logitmodel, tmp.df, type = "response")
ErrorRates <- data.frame(TrainingError = loss.func(tmp.df$high_use, tmp.df$probability), TestingError = cv.glm(data = tmp.df, glmfit = logitmodel, cost = loss.func)[[3]][1])
detach(tmp.df); cat("\nError rates:\n"); print.data.frame(ErrorRates); return(ErrorRates)
}
Alright, let’s start culling our oversized flock of variables.
Round1 <- logregr(alcohol, high_use ~ sex+goout+failures+internet+Pstatus+studytime+address+G3+reason+activities)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9951 -0.7682 -0.4803 0.7653 2.5695
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.590348 0.924000 -2.803 0.00506 **
## sexM 0.683332 0.273356 2.500 0.01243 *
## goout 0.783408 0.124819 6.276 3.47e-10 ***
## failures 0.306092 0.262269 1.167 0.24317
## internetyes 0.334722 0.391270 0.855 0.39229
## PstatusT -0.193650 0.430505 -0.450 0.65284
## studytime -0.374028 0.178834 -2.091 0.03648 *
## addressU -0.874049 0.329809 -2.650 0.00805 **
## G3 0.001252 0.044276 0.028 0.97744
## reasonhome 0.477262 0.317827 1.502 0.13319
## reasonother 0.970762 0.449768 2.158 0.03090 *
## reasonreputation -0.049525 0.353233 -0.140 0.88850
## activitiesyes -0.396111 0.267416 -1.481 0.13854
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 376.70 on 369 degrees of freedom
## AIC: 402.7
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2172775 0.2356021
Grades in Math and Portuguese are clearly irrelevant. Out they go.
Round2 <- logregr(alcohol, high_use ~ sex+goout+failures+internet+Pstatus+studytime+address+reason+activities)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9976 -0.7691 -0.4802 0.7640 2.5717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.57628 0.77844 -3.310 0.000934 ***
## sexM 0.68366 0.27308 2.504 0.012295 *
## goout 0.78296 0.12381 6.324 2.55e-10 ***
## failures 0.30335 0.24370 1.245 0.213222
## internetyes 0.33537 0.39061 0.859 0.390572
## PstatusT -0.19497 0.42790 -0.456 0.648644
## studytime -0.37348 0.17775 -2.101 0.035629 *
## addressU -0.87235 0.32426 -2.690 0.007139 **
## reasonhome 0.47726 0.31782 1.502 0.133180
## reasonother 0.97080 0.44985 2.158 0.030925 *
## reasonreputation -0.04928 0.35311 -0.140 0.889016
## activitiesyes -0.39558 0.26674 -1.483 0.138076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 376.70 on 370 degrees of freedom
## AIC: 400.7
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2198953 0.2329843
AIC score went down a bit. Good. I find myself eyeballing the z-values column the most. Its absolute value is often a direct measure of the significance of the variable, and unlike the p-value, significance grows as distance from zero grows. One important thing to note, however, is the different treatment of continuous and categorical variables. With multi-class categorical variables, every level except the first one (the reference level) is listed as a “variable” of its own, with its own coefficients and other statistics. This means that we cannot extract the significance value of reason directly from this printout. Instead, we must calculate the range of z-value variation exhibited by all levels as a whole. In this example, reasonreputation has a z-value of -0.140 while reasonother has a 2.158. The total z-value of this variable, then, is about 2.3, indicating that this is quite significant a variable indeed.
The continuous variables are easier to interpret: the coefficient represents the effect of a one-unit change in the variable’s value, and the z-value represents the overall significance of the variable when all of its variation taken into account
The next variable to go is PstatusT. I’m a little surprised that the cohabitation status of the parents is showing so little of an effect. I assumed that children of intact nuclear families would drink considerably less than those of single parents. Guess not! Next round:
Round3 <- logregr(alcohol, high_use ~ sex+goout+failures+internet+studytime+address+reason+activities)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0061 -0.7694 -0.4835 0.7610 2.5605
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7428 0.6894 -3.978 6.94e-05 ***
## sexM 0.6813 0.2731 2.495 0.01260 *
## goout 0.7829 0.1238 6.323 2.57e-10 ***
## failures 0.2981 0.2435 1.224 0.22084
## internetyes 0.3265 0.3901 0.837 0.40265
## studytime -0.3756 0.1775 -2.115 0.03440 *
## addressU -0.8652 0.3240 -2.670 0.00758 **
## reasonhome 0.4826 0.3176 1.520 0.12860
## reasonother 0.9725 0.4490 2.166 0.03031 *
## reasonreputation -0.0427 0.3529 -0.121 0.90371
## activitiesyes -0.4036 0.2661 -1.517 0.12939
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 376.91 on 371 degrees of freedom
## AIC: 398.91
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2172775 0.2329843
AIC improved again.
I suppose internet is the next one to go. Its effect is totally contrary to my presumption. I assumed that people with no internet would drink more because they have fewer other entertainment options available. Not so. Live and learn…
Round4 <- logregr(alcohol, high_use ~ sex+goout+failures+studytime+address+reason+activities)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9874 -0.7710 -0.4786 0.7802 2.4739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.56314 0.65082 -3.938 8.20e-05 ***
## sexM 0.68918 0.27284 2.526 0.0115 *
## goout 0.78937 0.12362 6.385 1.71e-10 ***
## failures 0.28263 0.24232 1.166 0.2435
## studytime -0.37424 0.17751 -2.108 0.0350 *
## addressU -0.80010 0.31429 -2.546 0.0109 *
## reasonhome 0.50294 0.31649 1.589 0.1120
## reasonother 0.98488 0.44984 2.189 0.0286 *
## reasonreputation -0.01745 0.35117 -0.050 0.9604
## activitiesyes -0.38086 0.26462 -1.439 0.1501
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 377.63 on 372 degrees of freedom
## AIC: 397.63
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2198953 0.2251309
AIC keeps modestly improving.
I think failures goes next. Not gonna miss it. It was another of those whose very inclusion as a predictor seemed slightly dubious due to uncertainty about the directionality of the link, i.e., which really causes which. Begone, foul variable!
Round5 <- logregr(alcohol, high_use ~ sex+goout+studytime+address+reason+activities)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8740 -0.7905 -0.4804 0.7770 2.5924
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.45141 0.64179 -3.820 0.000134 ***
## sexM 0.70513 0.27185 2.594 0.009493 **
## goout 0.80654 0.12321 6.546 5.91e-11 ***
## studytime -0.41612 0.17424 -2.388 0.016928 *
## addressU -0.82719 0.31279 -2.645 0.008179 **
## reasonhome 0.51028 0.31477 1.621 0.104994
## reasonother 0.95748 0.44811 2.137 0.032621 *
## reasonreputation -0.03361 0.35053 -0.096 0.923606
## activitiesyes -0.39805 0.26353 -1.510 0.130924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 379.00 on 373 degrees of freedom
## AIC: 397
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2198953 0.2329843
Now things are getting harder. The p-value is just a rough guideline that one scientist made up in the 1930s when it was hard work to perform all these calculations manually over and over again. It is not an infallible litmus test of significance. All of the remaining variables seem at least somewhat relevant to me. It is therefore with a heavy heart that I remove activities from the list of predictors – I only do so to comply with the exercise instructions.
Round6 <- logregr(alcohol, high_use ~ sex+goout+studytime+address+reason)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9306 -0.7778 -0.4823 0.8296 2.5266
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4862 0.6390 -3.891 9.98e-05 ***
## sexM 0.6266 0.2656 2.359 0.01830 *
## goout 0.7897 0.1217 6.491 8.55e-11 ***
## studytime -0.4513 0.1734 -2.603 0.00924 **
## addressU -0.8104 0.3120 -2.598 0.00939 **
## reasonhome 0.5091 0.3125 1.629 0.10334
## reasonother 0.9321 0.4493 2.075 0.03802 *
## reasonreputation -0.0993 0.3474 -0.286 0.77500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 381.30 on 374 degrees of freedom
## AIC: 397.3
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2198953 0.2303665
We are essentially finished now. At this stage, it is no longer clear whether the current model is better than the previous one that included activities. The AIC score actually got worse after removing it, supporting my impression that it was no longer wise to keep eliminating predictors. On the other hand, Testing Error improved slightly. Frankly, I’m having a hard time wrapping my head around these numbers: both AIC and Residual Deviance increased, and both those statistics are supposedly derived (at least in part) from how well the model fits the data. Then how can prediction accuracy improve at the same time as those two measures become worse?
In any case, all the remaining predictors are significant from this point on – according to both statistical math and plain common sense. If we remove more, we are bound to witness an upswing in both error rate and AIC score:
Round7 <- logregr(alcohol, high_use ~ sex+goout+studytime+address)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7689 -0.8106 -0.4822 0.7589 2.4343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1931 0.6025 -3.640 0.000273 ***
## sexM 0.6525 0.2617 2.493 0.012652 *
## goout 0.7727 0.1203 6.421 1.36e-10 ***
## studytime -0.4965 0.1691 -2.936 0.003329 **
## addressU -0.7287 0.3009 -2.421 0.015467 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 388.57 on 377 degrees of freedom
## AIC: 398.57
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2277487 0.2277487
Say what? AIC and Residual Deviance deteriorated further. Even Training Error went up, all of which was to be expected from removing a clearly significant predictor. Yet Testing Error improved again???? I am stumped. I can’t even. Basic common sense would suggest, however, that the more accurately a model predicts outcomes and the simpler it is, the better it is, no matter what any other fancy mathematical calculations might say. Therefore, we will keep playing this game for as long as Testing Error keeps improving. The next variable to go is address.
Round8 <- logregr(alcohol, high_use ~ sex+goout+studytime)
##
## Call:
## glm(formula = model.formula, family = "binomial", data = tmp.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7173 -0.8038 -0.5212 0.8877 2.6290
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.7121 0.5691 -4.766 1.88e-06 ***
## sexM 0.6725 0.2585 2.602 0.00927 **
## goout 0.7482 0.1188 6.298 3.01e-10 ***
## studytime -0.4866 0.1679 -2.898 0.00375 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 394.38 on 378 degrees of freedom
## AIC: 402.38
##
## Number of Fisher Scoring iterations: 4
##
##
## Error rates:
## TrainingError TestingError
## 1 0.2460733 0.2460733
WHOA stop right there! Enough! Now EVERYTHING got worse. That was clearly one step too far. We are finished. Let’s build a data frame reporting all the training and testing errors that we observed in each phase of this process, and then draw a chart of their progression:
(Errors<-cbind(PredictorsRemaining=10:3,rbind(Round1,Round2,Round3,Round4,Round5,Round6,Round7,Round8)))
## PredictorsRemaining TrainingError TestingError
## 1 10 0.2172775 0.2356021
## 2 9 0.2198953 0.2329843
## 3 8 0.2172775 0.2329843
## 4 7 0.2198953 0.2251309
## 5 6 0.2198953 0.2329843
## 6 5 0.2198953 0.2303665
## 7 4 0.2277487 0.2277487
## 8 3 0.2460733 0.2460733
ggplot(data=Errors, aes(x=PredictorsRemaining)) + geom_line(aes(y=TrainingError, x=PredictorsRemaining-0.017, color="TrainingError")) + geom_line(aes(y=TestingError, color="TestingError",x=PredictorsRemaining+0.017)) + scale_x_reverse(breaks=10:3) + theme(panel.grid.minor=element_blank()) + scale_y_continuous(limits = c(.20,.26), breaks = c(.20,.21,.22,.23,.24,.25,.26)) + geom_point(aes(y=TestingError,x=PredictorsRemaining+0.017)) + geom_point(aes(y=TrainingError, x=PredictorsRemaining-0.017)) + ylab("Error Rate") + xlab("Number of Predictors in Model") +ggtitle("Error Rate Progression in Backward Elimination of Predictors", subtitle="N = 382")
Going by the simple and intuitive criteria of model economy and prediction power, there is no question about the winner. The model in Round 7, containing the predictors goout, sex, address and studytime, achieves by far the best combination of simplicity and accuracy. This partly contradicts the AIC and Residual Deviance test statistics, both of which implied the superiority of Model 5 (which also included reason and activities). But I will go with common sense here, awarding the win to Model 7. Here’s the winning function call:
logregr(alcohol, high_use ~ sex+goout+studytime+address)
This relatively laborious backwards-elimination process that we’ve just completed can be done automatically by many software tools. One of them is Rbrul(), a popular R program with an interactive interface that supports complex methods such as generalized mixed modeling. However, it is risky to rely on automated algorithms to identify significant variables. Computers can calculate correlations but they can’t detect causal relationships. They can’t decide which variables are meaningful to start the analysis with. A computer couldn’t have made my choice – based on real-world knowledge – to leave absences out of the initial set of variables due to its dubious causal status. At the time of this writing (2017), computers are still idiot savants – great at calculus, but ultimately dumb.
This is the Clustering and Classification homework. We’ll start by taking a gander at the data.
library(MASS)
data("Boston")
str(Boston);View(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
These variable names tell us little about what they represent. I’ll just copypaste the descriptions from the original website:
crim
per capita crime rate by town.
zn
proportion of residential land zoned for lots over 25,000 sq.ft.
indus
proportion of non-retail business acres per town.
chas
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox
nitrogen oxides concentration (parts per 10 million).
rm
average number of rooms per dwelling.
age
proportion of owner-occupied units built prior to 1940.
dis
weighted mean of distances to five Boston employment centres.
rad
index of accessibility to radial highways.
tax
full-value property-tax rate per \$10,000.
ptratio
pupil-teacher ratio by town.
black
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat
lower status of the population (percent).
medv
median value of owner-occupied homes in \$1000s.
Not even the host page tells us much about the purpose of this dataset. If I had to guess, I’d guess it was collected to analyze correlates and predictors of house prices. Here are the summaries:
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Because Linear Discriminant Analysis is linear, it requires the explanatory variables to be continuous. That is doubtless why this dataset was chosen for this chapter – there are only numeric variables. Despite the use of the word linear, LDA does NOT require the predictor variables to be linear. It only requires that they can be represented numerically. Even our Boston dataset contains one binary variable, namely chas. Since LDA apparently places no requirements on the nature of the predictor variables, it is quite a versatile statistical tool indeed. What it does require is that the predictors be normally distributed and have the same variance. However, according to this page, there are valid work-arounds even when these requirements are not met.
We will now proceed to draw a plot of the correlations of each pair of variables:
library("corrplot");library("tidyverse")
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
## select(): dplyr, MASS
(matriisi<-cor(Boston)) %>% corrplot(method="circle")
Circle diameter and darkness illustrate the strength of the variable’s correlation with the variable that intersects with it in the matrix. Blue color means a positive correlation, red is negative. The strongest correlations that catch my eye in this plot are:
Now it’s time to scale the data.
Boston.unscaled<-Boston #Backing up the unscaled version
Boston<-scale(Boston)
summary(Boston)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Now every variable has been standardized and centered by subtracting the mean from the value and dividing the remainder by the standard deviation of that variable. It makes comparisons of variance much easier and more intuitive. Next we convert crim(e rate) into a categorical variable AKA factor. This time we also do what is seldom advisable, i.e., delete the old, numeric variable afterwards:
Boston<-as.data.frame(Boston)
(kvantiilit<-quantile(Boston$crim))
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
Boston$crime.factor<-cut(Boston$crim, kvantiilit, include.lowest = TRUE, labels=c("low","low-ish","high-ish","high"))
Boston<-Boston[,-1]
The scale() function turned Boston into a matrix object, which doesn’t behave quite the way we want here, so I had to convert it back into a data frame before being able to proceed. Don’t ask why scale() does that.
Now we must randomly split the data into a training set consisting of 80% of the observations and a testing set comprising the remaining 20%:
set.seed(10)
TrainingRows<-sample(nrow(Boston),size=nrow(Boston)*.8) #Randomly select 80% of all rows
TrainingSet<-Boston[TrainingRows,] #Create training set
TestingSet<-Boston[-TrainingRows,] #Create testing set
head(TrainingSet)
## zn indus chas nox rm age
## 257 3.3717021 -1.07673449 -0.2723291 -1.38676461 1.6642999 -1.2211827
## 155 -0.4872402 1.23072696 3.6647712 2.72964520 -0.2215067 0.9742880
## 216 -0.4872402 -0.07970124 -0.2723291 -0.56693456 -0.1460744 -0.9298742
## 349 2.9429307 -1.33036576 -0.2723291 -1.03294322 0.4986579 -1.3810470
## 43 -0.4872402 -0.61611679 -0.2723291 -0.92075595 -0.1645767 -2.2016841
## 113 -0.4872402 -0.16424500 -0.2723291 -0.06640675 -0.5289287 0.8641592
## dis rad tax ptratio black lstat
## 257 1.20674602 -0.7521778 -0.97448656 -1.18041474 0.3249467 -1.3363648
## 155 -0.97147402 -0.5224844 -0.03107419 -1.73470120 -0.3905371 0.3454580
## 216 0.07140456 -0.6373311 -0.77868399 0.06672981 0.4047979 -0.4457409
## 349 2.16029607 -0.6373311 -0.76088376 -0.67231881 0.3753329 -0.9330634
## 43 0.91458805 -0.7521778 -1.03975408 -0.25660396 0.2924148 -0.9582698
## 113 -0.68463492 -0.4076377 0.14099473 -0.30279450 0.4192565 0.4980964
## medv crime.factor
## 257 2.3341253 low
## 155 -0.6015814 high-ish
## 216 0.2682577 low-ish
## 349 0.2138927 low
## 43 0.3008766 low-ish
## 113 -0.4058676 low-ish
head(TestingSet)
## zn indus chas nox rm age
## 1 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629 -0.1198948
## 3 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456 -0.2655490
## 4 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978 -0.8090878
## 5 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620 -0.5106743
## 14 -0.4872402 -0.4368257 -0.2723291 -0.1440749 -0.4776917 -0.2406812
## 16 -0.4872402 -0.4368257 -0.2723291 -0.1440749 -0.6413655 -0.4289659
## dis rad tax ptratio black lstat
## 1 0.1400750 -0.9818712 -0.6659492 -1.4575580 0.4406159 -1.0744990
## 3 0.5566090 -0.8670245 -0.9863534 -0.3027945 0.3960351 -1.2075324
## 4 1.0766711 -0.7521778 -1.1050216 0.1129203 0.4157514 -1.3601708
## 5 1.0766711 -0.7521778 -1.1050216 0.1129203 0.4406159 -1.0254866
## 14 0.4333252 -0.6373311 -0.6006817 1.1753027 0.4406159 -0.6151835
## 16 0.3341188 -0.6373311 -0.6006817 1.1753027 0.4265954 -0.5857761
## medv crime.factor
## 1 0.1595278 low
## 3 1.3229375 low
## 4 1.1815886 low
## 5 1.4860323 low
## 14 -0.2318998 high-ish
## 16 -0.2862647 high-ish
So now we will follow a procedure very similar to the Cross-Validation seen in the previous chapter. We will first build a model on the training dataset, then we will test its predictive power on the testing dataset. A new, useful detail is that if you want to use all the variables found in the data frame as predictors, you can use a single dot as a wildcard for that:
(lda.fit<-lda(crime.factor ~ ., data = TrainingSet))
## Call:
## lda(crime.factor ~ ., data = TrainingSet)
##
## Prior probabilities of groups:
## low low-ish high-ish high
## 0.2376238 0.2524752 0.2475248 0.2623762
##
## Group means:
## zn indus chas nox rm
## low 0.9594165 -0.8884546 -0.14929469 -0.8605362 0.40702138
## low-ish -0.1166892 -0.3020359 0.03646311 -0.5618582 -0.09366532
## high-ish -0.3911954 0.2128639 0.20012296 0.3838094 0.08903229
## high -0.4872402 1.0149946 -0.08661679 1.0257502 -0.39221614
## age dis rad tax ptratio
## low -0.8694440 0.8295029 -0.6863802 -0.7605747 -0.46686714
## low-ish -0.3840715 0.3607010 -0.5528850 -0.4794539 -0.07048326
## high-ish 0.4321347 -0.4042103 -0.3892622 -0.2854989 -0.29725164
## high 0.8311809 -0.8556179 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3754402 -0.75841326 0.50644405
## low-ish 0.3219624 -0.16902036 0.02254946
## high-ish 0.1455522 0.06834233 0.18486185
## high -0.8015896 0.90526853 -0.73923748
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.134711446 0.815076287 -1.02611246
## indus 0.033748436 -0.336884258 0.15341827
## chas -0.003030025 -0.059884611 0.11849430
## nox 0.267076396 -0.592218405 -1.25727830
## rm 0.055354251 -0.106946680 -0.07288027
## age 0.310296486 -0.303533896 -0.23368301
## dis -0.148881410 -0.295690803 0.24054891
## rad 3.460334684 0.875004531 -0.16525987
## tax 0.009992405 -0.072742670 0.73620027
## ptratio 0.122422773 0.125987461 -0.20686483
## black -0.143035765 -0.007637357 0.10294691
## lstat 0.133606922 -0.351329562 0.32275482
## medv -0.002463537 -0.509212123 -0.21656505
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9558 0.0336 0.0106
Linear Discriminant Analysis is a pattern-recognition and dimensionality-reduction technique. It is used to to identify patterns in the behavior of a set of continuous explanatory variables in 2 or more known categories/groups (the dependent variable), and distill these different behaviors into a set of “discriminants” that are specific combinations of values that the explanatory variables tend to take in each category/group.
For instance, imagine you’re an entrepreneur selling a single product that you’ve worked years on. To improve the success rate of your marketing efforts, you might gather all the available information on the people you’ve ever solicited in your career, and conduct a LDA to identify what the people who actually buy from you have in common. Say you’re able to access the information on the following set of explanatory variables: age, income level, BMI, number of children, hair length, size of current home town – that’s 7 explanatory variables. You could then use LDA to determine which of those 7 variables are actually significant in differentiating buyers from non-buyers and which aren’t. The significant ones have an LD coefficient with a high absolute value, while the coefficients of the non-significant differentiators are closer to zero. In our hypothetical scenario, for example, LDA might reveal that the people who buy from you tend to be over the age of 30, have more than 3 kids, live in very small towns, and have medium-to-low income. That is their “linear discriminant”, and you can multiply your sales by focusing your marketing on people who match that profile.
In the exercise at hand, we’re trying to identify the discriminants of the 4 different levels of crime rate. I’m not particularly enamored with this categorization: I view this dependent variable as essentially continuous, and its conversion into a factor seems somewhat artificial to me. But the instructions are what they are, and obey we must. Here’s an arrow bi-plot of the LDA:
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(TrainingSet$crime.factor)
plot(lda.fit, dimen = 2, col=classes, pch=classes);lda.arrows(lda.fit, myscale = 1.5)
Frankly, I don’t get much out of this plot. It’s too messy and cluttered for me to fully wrap my head around.
Our next step is using the model for predictions. We’ll start by storing the actual crime categories of the testing set in a separate vector and then removing that column from the dataset, in order to make the prediction game more interesting:
correct.classes<-TestingSet$crime.factor
TestingSet<-TestingSet[,which(colnames(TestingSet)!="crime.factor")]
Then we make predictions in a familiar way…
lda.pred <- predict(lda.fit, newdata = TestingSet)
…followed by a cross-tabulation of the predicted and observed outcomes:
table(real = correct.classes, predicted = lda.pred$class)
## predicted
## real low low-ish high-ish high
## low 19 11 1 0
## low-ish 5 14 5 0
## high-ish 1 11 13 1
## high 0 0 1 20
This shows us that the model does well at predicting low and high crime rates – a useful model, in other words. The intermediate categories cause much more difficulty though.
Cluster analysis is used to identify meaningful categories within a population. A meaningful category is a subgroup of the population whose members, in one way or another, behave similarly to other members of that same group, and dissimilarly from the members of the other subgroups.
As humans, we have a built-in pattern recognition firmware that helps us place the real-world phenomena that we encounter into pre-determined categories and respond to them accordingly. Therefore, the utility of cluster analysis might not be quite as intuitively obvious as that of a technique such as regression analysis or LDA. An example of a situation where cluster analysis may be useful is if, say, we discover a new species of deep-sea fish. At the beginning, we will have very little knowledge of the scale of variation – morphological or otherwise – present in the species. We might even have stumbled upon an entire new taxon of fish comprising more than one separate species! In cluster analysis, we would gather measurements on a set of characteristics from as large a sample of the population as possible – say, weight, length, number of scales, tooth length, etc. – and then use cluster analysis to see how many meaningful subgroups the population seems to divide into. After doing so, we might summon a board of biologists to decide whether these subgroups represent a set of sub-species of a single new species, or a set of more than one new species. Then we would proceed to give them names, descriptions, and so on.
In the present exercise, we will use cluster analysis to see how many meaningful group distinctions emerge among the different Boston districts represented in our dataset. To do this, we need a version of the Boston dataset where crim is back in numeric format, and everything has been scaled:
Boston.adulterated<-Boston #Back up the version in which crime is a factor.
Boston<-as.data.frame(scale(Boston.unscaled)) #Scale the original version of Boston and make sure the result is a data frame:
head(Boston)
## crim zn indus chas nox rm
## 1 -0.4193669 0.2845483 -1.2866362 -0.2723291 -0.1440749 0.4132629
## 2 -0.4169267 -0.4872402 -0.5927944 -0.2723291 -0.7395304 0.1940824
## 3 -0.4169290 -0.4872402 -0.5927944 -0.2723291 -0.7395304 1.2814456
## 4 -0.4163384 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.0152978
## 5 -0.4120741 -0.4872402 -1.3055857 -0.2723291 -0.8344581 1.2273620
## 6 -0.4166314 -0.4872402 -1.3055857 -0.2723291 -0.8344581 0.2068916
## age dis rad tax ptratio black
## 1 -0.1198948 0.140075 -0.9818712 -0.6659492 -1.4575580 0.4406159
## 2 0.3668034 0.556609 -0.8670245 -0.9863534 -0.3027945 0.4406159
## 3 -0.2655490 0.556609 -0.8670245 -0.9863534 -0.3027945 0.3960351
## 4 -0.8090878 1.076671 -0.7521778 -1.1050216 0.1129203 0.4157514
## 5 -0.5106743 1.076671 -0.7521778 -1.1050216 0.1129203 0.4406159
## 6 -0.3508100 1.076671 -0.7521778 -1.1050216 0.1129203 0.4101651
## lstat medv
## 1 -1.0744990 0.1595278
## 2 -0.4919525 -0.1014239
## 3 -1.2075324 1.3229375
## 4 -1.3601708 1.1815886
## 5 -1.0254866 1.4860323
## 6 -1.0422909 0.6705582
Then we calculate the Euclidean distance between each town:
dist_eu<-dist(Boston)
Finally, we apply the clever algorithm shown in the Datacamp exercise in order to visually identify the optimal (most succinct and informative) number of of clusters (groups) into which the dataset divides:
k_max <- 10;twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss});plot(1:k_max, twcss, type='b')
Here we face the same dilemma as in stepwise regression: we can see that increasing the number of clusters decreases residual variance, which is the goal. But more clusters means more categorizations, i.e., more complexity, more explaining to do, and more cognitive load. Therefore, we will abide by the rule of thumb – the optimal number of clusters is the one that brings about the most radical change in residual variance – and declare that 2 is the winning number. If we were seriously pursuing this line of investigation, we would now examine the clusters, identify their shared characteristics, and perhaps give them informative names or labels. But those are not the instructions. The instructions are to visualize the clusters. Hence now we visualize the clusters:
km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)
I don’t get much out of this plot, either. There’s too much stuff jammed into a tight space.
We’ll now perform this analysis in reverse. We’ll take 4 clusters – this seems a relatively good number, judging from the second-to-last-plot – and perform LDA on Boston with the 4 clusters as target classes, in order to try and identify their common characteristics.
km4<-kmeans(dist_eu, centers = 4)
Boston$cluster<-km4$cluster
(ld4.4c<-lda(cluster ~ ., data=Boston))
## Call:
## lda(cluster ~ ., data = Boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2707510 0.2173913 0.1897233 0.3221344
##
## Group means:
## crim zn indus chas nox rm
## 1 0.9759730 -0.48724019 1.0836745 0.07252643 1.2746122 -0.51504943
## 2 -0.2781708 -0.48236779 0.4449338 -0.27232907 0.1112885 -0.34491541
## 3 -0.3904431 1.29327757 -0.8829277 0.67093453 -0.7094969 1.05013774
## 4 -0.4026194 -0.02663978 -0.6910740 -0.27232907 -0.7285393 0.04717328
## age dis rad tax ptratio black
## 1 0.8854486 -0.9186628 1.3368585 1.35643451 0.5296467 -0.8273642
## 2 0.4730244 -0.4156930 -0.2959232 -0.05777454 0.4757260 0.1720595
## 3 -0.6459300 0.7595211 -0.6074231 -0.64709824 -0.9518678 0.3430417
## 4 -0.6830053 0.6053314 -0.5661684 -0.71996870 -0.2055960 0.3772414
## lstat medv
## 1 0.9917971 -0.7561841
## 2 0.2721432 -0.3240236
## 3 -0.8065795 1.1616548
## 4 -0.5422106 0.1700672
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.08347580 0.25229152 0.23494161
## zn -0.50086410 1.07427556 -0.73572327
## indus -0.19066361 -0.38532493 -0.51849315
## chas 0.05133571 0.66311801 -0.02000862
## nox -1.48554172 0.53651016 0.37267367
## rm 0.20818955 0.40857906 -0.18625561
## age -0.15978026 0.02494374 -0.66480469
## dis 0.22369038 0.20325447 0.52351112
## rad -0.22891423 -0.12421387 1.52917037
## tax -1.04859002 0.58543578 -0.69932224
## ptratio -0.45539412 -0.13690964 -0.60978717
## black 0.28206885 -0.13889043 -0.08150766
## lstat -0.43261121 0.62600094 -0.27284565
## medv -0.11825412 0.74838356 -0.33152714
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.8304 0.1374 0.0321
Judging from the coefficients of the Linear Discrimimants (LDs), it appears that air pollution is by far most significant discriminating variable between the clusters, followed by property tax rate. However, it is not entirely clear to me how these coefficients are to be interpreted and applied. There are always K-1 of them, where K is the number of classes/clusters. Is it the case that LD1 differentiates the first cluster from the second, while LD2 differentiates the second cluster from the third? Hmmm… if the treatment of factors in regression analysis is any indication, then it seems likelier that one of the classes/clusters is used as the reference cluster, and the Linear Discriminants differentiate the other clusters from this reference cluster. This would explain why Proportion of Trace (=explanatory power) is highest for LD1 and progressively much smaller for the other LDs – we already saw in the previous section that 2 is the optimal number of clusters, so the discriminant differentiating these two main clusters is by far the most significant one. Our current model has 4 clusters, and the drastically lower Proportion of Trace (explanatory power) of the other Linear Discriminants compared to the first one reflects the fact that adding new clusters after the first two yields diminishing returns. In other words, any additional classifications of the data that are done after the initial 2-way classification are much less helpful than the original 2-way classification. They do increase the overall predictive power, but it is debatable whether this improvement is enough to justify the resulting increase in model complexity.
However, what exactly do the coefficients represent? How are they calculated? In linear and logistic regression, we can manually predict the value of the response variable using the coefficients. Can we similarly use the linear discriminant coefficients to predict which cluster a given data point belongs to? If so, how exactly is this done? I have yet to find an answer to this, neither by Googling nor in any of the course materials.
At any rate, we can still use the Group Means data to improvise some hypotheses on the four clusters that have been identified:
Cluster 1: Lots of crime, lots of industry, lots of pollution, apartments are old and small, close proximity to central traffic routes and employment centers, and very few blacks in the population. Hmmm….this sounds like an old-fashioned white working-class neighborhood – exactly the kind of place that is run by archetypcal gangsters in trench coats and fedoras. Think Al Capone, only in Boston instead of Chicago. In fact, if you’ve seen this film, then you know exactly the kind of place I’m envisioning.
Cluster 2: The main difference between this one and Cluster 1 is in the dramatic reduction in crime and the fact that blacks are allowed in this neighborhood. Otherwise the two clusters seem to fit a similar profile. Dis and rad indicate that this is a type of suburb that tends to be a bit farther from downtown but is overall still quite conveniently located. Houses are still old and small, though not so old as in Cluster 1. Therefore, I conjecture that this cluster represents the 2nd layer, or ring, of the city that was built around the core once the core became too cramped to accomodate the growing population. These places are still relatively blue-collar but they are no longer run by gangsters like Cluster 1 is. The absence of racist gangsters enables the entry of non-white ethnic groups into the neighborhood.
Cluster 3: These are suburbs proper – sub-urban in the most authentic sense of the word. They represent the Outer Ring of a city – relatively quiet and peaceful places with significantly less economical and industrial activity than in the previous two clusters. There is more space, so houses are bigger and tend to be new. People move here because of the increased peace and safety relative to downtown – typically when they have their first children. The flipside of the increased peace and safety is the reduced access to many services and entertainment options – you have to drive or use public transportation to access some of the shiniest and trendiest goodies. But couples with small children generally don’t care much for that kind of thing.
Cluster 4: These are also outer suburbs, but they differ from Cluster 3 in one important respect: there are no families. These are districts constructed to provide affordable housing to newcomers who are young, single, and seeking career opportunities. Apartments are new and small, so they cost more than in Clusters 1 and 2 but significantly less than in Cluster 3.
Now it’s time to draw one of those goofy arrow plots again – biplot is what I think it’s called. Sigh. Do I really have to?
luokat <- Boston$cluster
plot(ld4.4c, dimen = 2, col=luokat, pch=luokat);lda.arrows(ld4.4c, myscale = 1.5)
Try as I might, I just can’t extract a whole lot of useful information from this plot. Maybe I could if there were fewer variables and the picture didn’t get so cramped. But the instructions were what they were. Pretty much the only bits of data that I’m able to glean from this plot are these two:
1.Air pollution correlates with Cluster 1. Well that’s to be expected of the core regions of the city, where the bulk of cars and businesses are.
2.Property Tax rates also correlate with Cluster 1. This one is harder to decipher. I imagine it might be because being close to downtown is valued highly, especially by businesses, so those properties are also taxed more heavily regardless of their condition.
As the variables that are the most influential linear separators of the clusters, I can’t answer that confidently since I’m still unclear on how to interpret the Linear Determinant coefficients. But if the Group Means are any indicator (and they darn well should be), then I’d say the biggest differentiators are tax, indus, rad and nox. And I would also say that these variables likely reflect the the Downtown-Suburban dichotomy. This would also explain why the WCSS by Clusters plot earlier showed 2 to be the ideal number of clusters. It was probably showing us this same dichotomy that it was capturing, and our present 4-cluster model just adds a bit of extra nuance to it – at the cost of increased model complexity.
Another reason why it is inconvenient to have more clusters than strictly necessary is that spatial-geometrical visualization of the Linear Discriminants differentiating the clusters becomes impossible if the number of clusters exceeds 4. Computer screens are 2d, so visualizing the third dimension necessary to represent the third discriminant used in 4-cluster models is tricky and computationally expensive. And more than three discriminants cannot be geometrically represented at all – at least not in static images. To complete this chapter, we will now attempt to visually plot the three Linear Discriminants defined by the LD model that we built to find differentiating characteristics between the 4 categories of crime rate. This 3d plot is a product of the plotly package:
model_predictors<-TrainingSet[,which(colnames(TrainingSet)!="crime.factor")]
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=TrainingSet$crime.factor)
As you can see, the colors of the balls represent the different crime rates. The High Crime group is quite separate from the other three groups, with little overlap, while the three other groups overlap significantly among themselves. Our next step is to draw a similar plot, this time coloring the balls according to the Cluster that they were assigned to in the previous exercise:
TrainingSet$four.clusters<-as.factor(km4$cluster[TrainingRows])
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=TrainingSet$four.clusters)
We can see here that there is significant correspondence between the unnamed clusters into which we divided the dataset and the four different categories of crime rate. The cluster that we hypothetically branded the Working-Class Gangster Neighborhood is also host to the majority of the High Crime districts. The equivalence is largely lost outside the High Crime group though – the motley crew formed by the remaining data points contains units from all four clusters, and their alignment is not very systematic. We can therefore make the tentative suggestion that high crime rate is a significant factor in the categorization of suburbs into distinct classes. But we’re talking high crime rate specifically. When crime rate drops below a certain threshold, it seems to lose its differentiating power.
In this chapter we deal with data that represents countries, not people or towns. This data was collected to analyze correlations between what the UN calls ‘human development’ and a variety of other factors. Should be relatively interesting! We start by importing the dataset.
human <- read.table("C:/Users/juho/Desktop/IODS-project/IODS-project/data/human.txt", header=T, sep="\t",quote=NULL)
Then we’ll visualize it:
library(ggplot2)
library(GGally)
ggpairs(human, mapping = aes(alpha=0.3), lower = list(combo = wrap("facethist", bins = 20))) + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Many of the variables are highly correlated with each other, as is to be expected. Some of these correlations are self-evident, such as the one between life expectancy and maternal mortality. Adolescent birth rates seem negatively correlated with life expectancy. In this case I suspect it is probably life expectancy that drives adolescent birth rates and not vice versa – a high risk of dying young instills urgency to the endeavor of continuing the species, while a long life expectancy probably makes people want to take their time choosing the right partner and circumstances for breeding. Gross National Income and life expectancy both correlate with expected length of education. In this case it is hard to say which one is causing which in each pair.
PCA is used to try and summarize, or condense, the overall variation of a set of variables into a minimal number or “principal components”, i.e., bundles of mutually correlated variables. Each bundle must be orthogonal (uncorrelated) with the other bundles, while all variables within one bundle must correlate with each other, positively or negatively. Thus PCA is similar to LDA. The difference is that where LDA aims to summarize differences between different user-specified groups, PCA aims to summarize the overall variance. Let’s now perform PCA on the human dataset:
(pca<-prcomp(human))
## Standard deviations:
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
##
## Rotation:
## PC1 PC2 PC3 PC4
## Edu2.FM 5.607472e-06 -0.0006713951 -3.412027e-05 -2.736326e-04
## Labo.FM -2.331945e-07 0.0002819357 5.302884e-04 -4.692578e-03
## Edu.Exp 9.562910e-05 -0.0075529759 1.427664e-02 -3.313505e-02
## Life.Exp 2.815823e-04 -0.0283150248 1.294971e-02 -6.752684e-02
## GNI 9.999832e-01 0.0057723054 -5.156742e-04 4.932889e-05
## Mat.Mor -5.655734e-03 0.9916320120 1.260302e-01 -6.100534e-03
## Ado.Birth -1.233961e-03 0.1255502723 -9.918113e-01 5.301595e-03
## Parli.F 5.526460e-05 -0.0032317269 -7.398331e-03 -9.971232e-01
## PC5 PC6 PC7 PC8
## Edu2.FM 0.0022935252 2.180183e-02 -6.998623e-01 7.139410e-01
## Labo.FM -0.0022190154 3.264423e-02 -7.132267e-01 -7.001533e-01
## Edu.Exp -0.1431180282 9.882477e-01 3.826887e-02 7.776451e-03
## Life.Exp -0.9865644425 -1.453515e-01 -5.380452e-03 2.281723e-03
## GNI 0.0001135863 -2.711698e-05 8.075191e-07 -1.176762e-06
## Mat.Mor -0.0266373214 1.695203e-03 -1.355518e-04 8.371934e-04
## Ado.Birth -0.0188618600 1.273198e-02 8.641234e-05 -1.707885e-04
## Parli.F 0.0716401914 -2.309896e-02 2.642548e-03 2.680113e-03
(pca.summary<-summary(pca))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
I can’t make much sense of these results. It claims that only the first principal component (PC1) is meaningful. But this is probably because the analysis was done without scaling the dataset first. PCA assumes that variables with a high variance are more meaningful than ones with a small variance. The variables in human are measured in different units with different ranges – most are percentages that vary between 0 and 1, while use units like dollars or euros per capita and range between zero and infinity. It is to be expected that this will confuse PCA. But let’s nonetheless comply with the instructions and draw a biplot using the first two principal components as axes:
prossat<-round(100*pca.summary$importance[2, ], digits = 1)
laabelitjaleibelit<-paste0(names(prossat), " (", prossat, "%)")
biplot(pca, cex = c(0.8, 1), col = c("blue", "red"), xlab = laabelitjaleibelit[1], ylab = laabelitjaleibelit[2], main="Human development (unscaled) by two principal components")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
This also seems to support the notion that performing PCA on variables using different units and ranges doesn’t make much sense. You can even see how R complains about the virtual non-existence of the other principal components than the first. Their effects are virtually nullified by the disproportionate ranges of the variables in PC1.
Now let’s perform the same analysis and plotting on a scaled version of the dataset, and admire the results:
human.scaled<-scale(human)
pca.scaled<-prcomp(human.scaled)
(pca.scaled.summary<-summary(pca.scaled))
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
rosentit<-round(100*pca.scaled.summary$importance[2, ], digits = 1)
akselit<-paste0(names(rosentit), " (", rosentit, "%)")
biplot(pca.scaled, cex = c(0.8, 1), col = c("darkgreen", "purple"), xlab = akselit[1], ylab = akselit[2], main="Human development (scaled) by two principal components")
There. Though the high density of data points makes the plot messy, now we can actually get at least SOME useful information out of it. Principal Component 1, i.e. the most important combined element of variance in the dataset, is strongly positively correlated with life expectancy and expected years of education. It is strongly negatively correlated with the rates of teen pregnancy and maternal mortality. Principal Component 2 is strongly correlated with the level of female participation in politics and the workforce. That is why one end of this spectrum seems to consist almost exclusively of arab countries, where women’s participation in public forums is heavily restricted.
Now we shall attempt something similar to PCA, but with categorical variables. Our dataset is about people’s tea consumption habits. We begin by loading the necessary libraries – then we’ll import the data and survey it:
library(FactoMineR)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
This is a tad too many variables to visualize on a plot without cluttering it. We’ll select only 6 of them. Before removing variables, however, we’ll back up the original dataset first however.
tea.backup<-tea
tea<-tea[,c("Tea","How","how","where","sugar","lunch")]
head(tea)
## Tea How how where sugar lunch
## 1 black alone tea bag chain store sugar Not.lunch
## 2 black milk tea bag chain store No.sugar Not.lunch
## 3 Earl Grey alone tea bag chain store No.sugar Not.lunch
## 4 Earl Grey alone tea bag chain store sugar Not.lunch
## 5 Earl Grey alone tea bag chain store No.sugar Not.lunch
## 6 Earl Grey alone tea bag chain store No.sugar Not.lunch
Now we’ll plot this thing using the code from Datacamp:
library(tidyr)
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped
Bars illustrating distributions of factor levels
These plots plainly show us that the most popular form of tea is a regular tea bag, bought in a regular grocery store and served black. This is clearly a British dataset since Earl Grey is the preferred variety. Most people don’t have tea with lunch, and the use vs. non-use of sugar are evenly distributed. Next, we will apply MCA on this data and draw a biplot:
tea.mca<-MCA(tea,graph=F)
(mca.summary<-summary(tea.mca))
##
## Call:
## MCA(X = tea, graph = F)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## where | 0.702 0.681 0.055 |
## sugar | 0.065 0.001 0.336 |
## lunch | 0.000 0.064 0.111 |
## NULL
plot(tea.mca, invisible=c("ind"),habillage="quali")
The Dim’s apparently correspond go the Principal Components of PCA. The % of var and Cumulative % of var seem to correspond to Proportion of Variance and Cumulative Proportion in PCA. It appears that these principal components do a relatively poor job of explaining the variance of the dataset – the first two together only account for 30% of the total variance, while the first two PC’s in our previous exercise accounted for 70%.
The factor levels unpackaged and tea bag both play as huge role in the first principal component, as shown by their massively significant v.test values (which are supposedly equivalent to z-scores). Both of them are levels of the same variable – namely, how and thus measure the same thing. I guess this is because generally, whether you have your tea in a teabag or unpackaged is an indicator of whether you’re having it at home or in a café – and those things have many implications for a number of other variables too. This also explains why, on the biplot, unpackaged and tea shop are the two items most different from the rest yet similar to each other. Both of them are factor levels closely tied to the action of having tea outside your home. The same things explains why in the Categorical variables table, how and where are by far the two most important contributors to this principal component. They are flipsides of the same coin.
Tea type is the second-most important element in the first principal component. This is somewhat surprising. Within this variable, Earl Grey is the most influential factor level. This might suggest that people who drink this variety are different from the average population. An easy, if somewhat intellectually lazy, assumption is that the Earl Grey drinkers are stereotypical upper-class Englishmen impeccable Oxford accents, whose culture and lifestyle is very different from the common man.